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Abstract. We investigate the conversion of one-way nondeterministic 
finite automata and context-free grammars into Parikh equivalent one- 
way and two-way deterministic finite automata, from a descriptional 
complexity point of view. 

We prove that for each one-way nondeterministic automaton with n 
states there exist Parikh equivalent one-way and two-way deterministic 
automata with e'^(^"''"") and p{n) states, respectively, where p{n) is 
a polynomial. Furthermore, these costs are tight. In contrast, if all the 
words accepted by the given automaton contain at least two different 
letters, then a Parikh equivalent one-way deterministic automaton with 
a polynomial number of states can be found. 

Concerning context-free grammars, we prove that for each gram- 
mar in Chomsky normal form with h variables there exist Parikh 
equivalent one-way and two-way deterministic automata with 2^^^ ^ 
and 2'-'^^^ states, respectively. Even these bounds are tight. 
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1 Introduction 



It is well-known that the state cost of the conversion of nondeterministic fi- 
nite automata (Infas) into equivalent deterministic finite automata (Idfas) 
is exponential: using the classical subset construction |RS59j . from each n- 
state iNFA we can build an equivalent Idfa with 2" states. Furthermore, 
this cost cannot be reduced. 

In all examples witnessing such a state gap (e.g., |Lup63 lMF71tlMoo71] ). 



input alphabets with at least two letters and proof arguments strongly re- 
lying on the structure of words are used. As a matter of fact, for the unary 
case, namely the case of the one letter input alphabet, the cost reduces to 
e®(^^^^), as shown by Chrobak jChr86j . 

What happens if we do not care of the order of symbols in the 
words, i.e., if we are interested only in obtaining Idfas accepting 
sets of words which are equal, after permuting the symbols, to the 
words accepted by the given Infas? 

This question is related to the well-known notions of Parikh image and 
Parikh equivalence |Par66j , which have been extensively investigated in 
the literature (e.g., |(;ol77l lAEin2| ) even for the connections of semilin- 



ear sets |Huy80| and with other fields of investigation as, e.g., Presburger 
Arithmetics |GS66| . Petri Nets |Esp97| , logical formulas |VSS05j . formal 
verification [TolOaj . 

We remind the reader that two words over a same alphabet S are Parikh 
equivalent if and only if they are equal up to a permutation of their symbols 
or, equivalently, for each letter a S S, the number of occurrences of a in the 
two words is the same. This notion extends in a natural way to languages 
(two languages Li and L2 are Parikh equivalent when for each word in 
Li there is a Parikh equivalent word in L2 and vice versa) and to formal 
systems which are used to specify languages as, for instance, grammars and 
automata. Notice that in the unary case Parikh equivalence is just the 
standard equivalence. So, in the unary case, the answer to our previous 
question is given by the above mentioned result by Chrobak. 

Our first contribution in this paper is an answer to that question in the 
general case. In particular, we prove that the state cost of the conversion 
of n-state Infas into Parikh equivalent Idfas is the same as in the unary 
case, i.e., it is e®^^"'''^"-'. More surprisingly, we prove that this is due to the 
unary parts of languages. In fact, we show that if the given Infa accepts 
only nonunary words, i.e., each accepted word contains at least two different 
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letters, then we can obtain a Parikh equivalent Idfa with a polynomial 
number of states in n. Hence, while in standard determinization the most 
difficult part (with respect to the state complexity) is the nonunary one, in 
the "Parikh determinization" this part becomes easy and the most complex 
part is the unary one. 

In the second part of the paper we consider context-free grammars 
(CFGs). Parikh Theorem [Par 66] states that each context-free language is 
Parikh equivalent to a regular language. We study this equivalence from a 
descriptional complexity point of viewj^ Recently, Esparza, Ganty, Kiefer, 
and Luttenberger proved that each CFG G in Chomsky normal form with 
h variables can be converted into a Parikh equivalent Infa with 0(4'*) 
states [EGKL11| . In |LP12| it was proved that if G generates a bounded 
language then we can obtain a Idfa with 2'*°^^' states, i.e., a number expo- 
nential in a polynomial of the number of variables. In this paper, we are able 
to extend such a result by removing the restriction to bounded languages. 
We also reduce the upper bound to 2*^^'* A milestone for obtaining this 
result is the conversion of Infas to Parikh equivalent Idfas presented in 
the first part of the paper. By suitably combining that conversion (in par- 
ticular the polynomial conversion in the case of Infas accepting nonunary 
words) with the above mentioned result from [EGKLll] and with a result by 
Pighizzini, Shallit, and Wang [PSW02j concerning the unary case, we prove 
that each context-free grammar in Chomsky normal form with h variables 
can be converted into a Parikh equivalent Idfa with 2^^^^^ states. From 
the results concerning the unary case, it follows that this bound is tight. 

Even for this simulation, as for that of iNFAs by Parikh equivalent Idfas, 
the main contribution to the state complexity of the resulting automaton is 
given by the unary part. 

Finally, we consider conversions of Infas and CFGs into Parikh equiva- 
lent two-way deterministic automata (2dfas). Due to the fact that in the 
unary case these conversions are less expensive than the corresponding ones 
into Idfas, we are able to prove that each n-state Infa can be converted 
into an equivalent 2dfa with a number of states polynomial in n, and each 
context-free grammar in Chomsky normal form with h variables can be con- 
verted into a Parikh equivalent 2dfa with a number of states exponential 
in h. 

^For an i ntroductory survey to descriptional complexity, we address the reader 
to |GKK+02| . 
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2 Preliminaries 



We assume the readers to be familiar with the basic notions and properties 
from automata and formal language theory. We remind just a few notions, 
addressing the reader to standard textbooks (e.g., |HU79t[Sha08j ) for further 
details. 

Let S = {oi, 02, . . . , Om} be an alphabet of m letters. Let us denote 
by S* the set of all words over E including the empty word e. Given a word 
w £ S*, denotes its length and, for a letter a G S, \w\a denotes the 
number of occurrences of a in w. For a word u £ T,* , w is a prefix of u if 
u = wx for some word x G S*. If x 7^ e then w is a proper prefix of u. We 
denote by Pref(u) the set of all prefixes of u, and for a language L C E*, let 

Pref (L) = IJ Pref('u) . 

A language L has the prefix property or, equivalently, is said to be prefix-free 
if and only if for each word x £ L, each proper prefix of x does not belong 
to L. Given two sets A,B and a function / : A — )• S*, we say that / has 
the prefix property on B if and only if the language f{Ar]B) has the prefix 
property. 

In the paper we consider: 

• one-way deterministic and one-way nondeterministic finite automata 
(abbreviated as Idfas and Infas, respectively), 

• two-way deterministic finite automata (2dfAs), 

• context-free grammars (CFGs) and context-free languages (cfls). 

While in one-way automata the input is scanned from left to right, until 
reaching the end of the input, where the word is accepted or rejected, in 
two-way automata the head can be moved in both directions. At each step, 
depending on the current state and scanned input symbol and according to 
the transition function, the internal state is changed and the head is moved 
one position leftward, one position rightward or it is kept on the same cell. In 
order to locate the left and the right ends of the input, the word is given on 
the tape surrounded by two special symbols, the left and right endmarkers. 
We assume that a 2dfa starts the computation in a designed initial state, 
scanning the first input symbol and that its head cannot violate the end- 
markes, namely, there are no transitions reading the left (right) endmarker 
and moving to the left (right, respectively). In the literature, several slightly 
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different acceptance conditions for two-way automata have been considered. 
Here, we assume that a 2dfa accepts the input by entering in a special state 
Qf which is also halting. We use L(A) to denote the language accepted or 
defined by an automaton A. 

A CFG G is denoted by a quadruple {V, S, P, S), where V is the set of 
variables, S is the terminal alphabet, P is the set of productions, and S €zV 
is the start variable. By L{G) we denote the language generated or defined 
by G, namely the set of all words in E* that have at least one derivation by 
G from S. G is said to be in Chomsky normal form if all of its productions 
are in one of the three simple forms, either B — )• CD, B ^ a, or S ^ e, 
where a £ T,, B £ V , and C,D £ V \{S}. CFGs in Chomsky normal 
form are called Chomsky normal form grammars (Cnfgs). According to 
the discussion in |Gru73j . we employ the number of variables of Cnfgs as 
a "reasonable" measure of descriptional complexity for cfls. 

A word is said to be unary if it consists of A; > occurrences of a 
same symbol, otherwise it is said to be nonunary. A language L is unary if 
L ^ {a}* for some letter a. In a similar way, automata and CFGs are unary 
when their input and terminal alphabets, respectively, consist of just one 
symbol. 

Given an alphabet E = {ai,a2, ■ ■ ■ ,am} and a language L C S*, the 
unary parts of L are the languages 

Li = L n {aiY, L2 = Lr\ {02}*, . . . , L^ = Lr\ {a^y , 

while the nonunary part is the language 

Lo = L - (Li U L2 U . . . U L„) , 

i.e., the set which consists of all nonunary words belonging to the language L. 
Clearly, L = [JILqU. 

We denote the set of integers by Z and the set of nonnegative integers 
by N. Then and denote the corresponding sets of m-dimensional 
integer vectors including the null vector = (0, 0, . . . , 0). For 1 < i < m, 
we denote the i-th. component of a vector v by v[i]. 

Given k vectors vi,...,Vk £ Z™, we say that they are linearly inde- 
pendent if and only if for all ni, . . . E Z, nivi + • • • + n^Vk = implies 
ni = . . . = rifc = 0. It is well-known that, in this case, k cannot exceed m. 
The following result will be used in the paper. 

Lemma 2.1. Given k linearly independent vectors vi,...,Vk £ Z™ there 
are k pairwise different integers ti, . . . , t^. £ {!,..., m} such that Vj[tj] 7^ 0, 
for j = l,...,k. 
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Proof. Let V be the m x k matrix which has Vi,. . . ,Vk as columns. Since 

the given vectors are linearly independent, k < m. First, we suppose k = m. 
Then the determinant d(y) of V is defined and it is nonnull. 

If A; = 1 then the result is trivial. Otherwise, we can compute d{V) along 
the last column as 

k 

d(.v) = j2i-^y^''"km,k, 

i=l 

where is the determinant of the matrix ^ obtained by removing from 
V the row i and the column k. Since d{V) ^ 0, there is at least one index i 
such that Vk[i] and di^k 0. Hence, as tk we take such i. Using an induction 
on the matrix V^^,fc, we can finally obtain the sequence ti, . . . ,tk satisfying 
the statement of the theorem. 

Finally, wc observe that in the case A; < m, by suitably deleting m — k 
rows from V, we obtain a k x k matrix V' with d{V') / 0. Thus, we can 
apply the same argument toV. □ 

A vector v € is unary if it contains at most one nonzero component, 
i.e., / for some 1 < i,j < m implies i = j; otherwise, it is 

nonunary. By definition, the mill vector is unary. 

In the sequel, we reserve ^ for the componentwise partial order on N"^, 
i.e., tt ^ t; if and only if u[k] < v[k] for all 1 < A; < m. For a vector v G N'", 
let 

Pred(t;) = {u\ u :< v} . 

For u,v e N"*, V — u is defined to be a vector w with w[k] = v[k] — u[k] 
for all 1 < A; < m. Note that v — u is a vector in N"* if and only if u :<v. 

A linear set in N"* is a set of the form 

{vq + mvi + n2V2 H h HkVk I ni, 77-2, . . . , nfc G N} , (1) 

where A; > and Vq,Vi,V2, ■ ■ ■ jVi^. G N™. The vector Vq is called offset, 
while the vectors Vi, . . . ,Vk are called generators. A semilinear set in N™' 
is a finite union of linear sets in N"*. 

The Parikh map ip -.T,* ^ associates with a word w ^T,* the vector 

ll){w) = {\w\a^,\w\a2, ■ ■ ■ , \w\a^) , 

which counts the occurrences of each letter of S in w. The vector 'ip(w) is 
also called Parikh image of w. Notice that a word G S* is unary if and 
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only if its Parikh image 'iIj{w) is a unary vector. One can naturally generalize 
this map for a language L C S* as 

tp{L) = {iltiw) \w £ L} . 

The set ip{L) is called the Parikh image of L. Two languages L,L' C S* 
are said to be Parikh equivalent to each other if and only if i^iL) = ip{L'). 

Parikh equivalence can be defined not only between languages but among 
languages, grammars, and finite automata by referring, in the last two cases, 
to the defined languages. For example, given a language L, a CFG G, and a 
finite automata A, we say that: 

• G is Parikh equivalent to L if and only if ip{L{G)) = ip{L), 

• ^ is Parikh equivalent to L if and only if iIj(L{A)) = ip{L), 

• G is Parikh equivalent to A if and only if ip{L{G)) = ^l^{L{A)). 

Parikh's Theorem, proven in 1966 |Par66j . states that the Parikh image 
of any context-free language is a semilinear set. Since the class of regular 
languages is closed under union and each linear set as in ([T]) is the Parikh 
image of the regular language 

{wq} ■ {wi,W2, ■ ■ .,Wk}* , 

where, for i = 0,...,k, Wi = a^'^^'o^'^^' • • • Om Parikh's Theorem is 
frequently formulated by giving the following its immediate consequence: 

Theorem 2.2 ( [Par66j ) . Every context-free language is Parikh equivalent 
to a regular language. 

Apparently, two unary languages are Parikh equivalent if and only if they 



are equal. Hence, as a consequence of Theorem 2.2, each unary context-free 
language is regular. This result, which was firstly discovered in 1962 by 
Ginsburg and Rice |GR62] . earlier than Parikh's Theorem, has been studied 
from the descriptional complexity point of view in 2002 by Pighizzini, Shallit 
and Wang jPSW02) . proving the following: 

Theorem 2.3 ( |PSW02l Thms. 4, 6]). For any Cnfg with h variables that 
generates a unary language, there exist an equivalent Infa with at most 
22/i-i _|_ Y states and an equivalent Idfa with less than 2^ states. 

In the paper we will also make use of the transformation of unary iNFAs 
into Idfas, whose cost was obtained in 1986 by Chrobak |Chr86j : 

Theorem 2.4 ( |Chr86| ). The state cost of the conversion of n-state unary 
Infas into equivalent iDFAs is e©(V"-inn)_ 
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2.1 Preliminciry Constructions 

Here, we present some preliminary constructions which will be used in the 
rest of the paper. These constructions are simple and standard. They are 
given just for the sake of completeness. Hence, the trained reader can skip 
this part, going directly to the following sections. 

First, we consider some decomposition results: we show how to obtain au- 
tomata and grammars for the unary and nonunary parts of the languages de- 
fined by given automata and grammars, respectively. After, we will shortly 
discuss some composition results: how to obtain Idfas or 2dfas, respec- 
tively, accepting the union of languages defined by given iDFAs or 2dfas. 

Throughout the section, let us fix an m-symbol alphabet S = {ai,a2, 
. . . , am}- Let us starts by considering finite automata. 

Lemma 2.5. For each n-state iNFA A accepting a language L{A) C S*, 
there exist m + 1 iNFAs Ao, Ai, . . . , Am such that: 

• Aq has n{m -|- 1) -|- 1 states and accepts the nonunary part of L{A). 

• For i = 1, . . . ,m, Ai is a unary iNFA with n states which accepts the 
unary part L{A) Ci {ai}* . 

Furthermore, if A is deterministic then Aq, Ai, . . . , Am are deterministic, 
too. 

Proof. To accept an input w, the automaton Aq has to check that w is 
accepted by A and contains at least two different symbols. To do that, Aq 
uses the same states and transitions as A. However, in a preliminary phase, 
it keeps track in its finite control of the first letter of w, until discovering a 
different letter. 

The automaton Aq, besides all states and transitions of A, has a new 
initial state and m extra copies [g, 1], . . . , [q,m] of each state q of A. The 
transitions from the initial state of Aq simulate those from the initial state 
of A, also remembering the first symbol of the input, i.e., a transition in A 
which from the initial state, reading a symbol Oj, leads to the state q, is 
simulated in Aq by a transition leading to [q, i] . 

Prom a state [q,i], reading the same symbol the automaton Aq can 
move to each state \p, i] such that A from q reading can move to p. In 
this way, until the scanned input prefix consists only of occurrences of the 
same letter Oj, Aq simulates A using the ith copies of the states. However, 
when in a state [q, i] a symbol aj ^ Oi is read, having verified that the input 
contains at least two different letters, Aq can move to each state p which is 
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reachable in A from the state q, so entering the part of corresponding to 
the original A. The final states of Aq are the final states in the copy of A. 
From this construction we see that the number of states of Aq is n(m+l) + l. 
Furthermore, if A is deterministic then also Aq is deterministic. 

For the unary parts, it is easily seen that for z = 1, . . . , m, the automaton 
Ai can be obtained by removing from A all the transitions on the symbols 
aj 7^ ai- Clearly, also this construction preserves determinism. □ 

We can give a similar result in the case of CFGs. 

Lemma 2.6. For each h-variable Cnfg G generating a language L{G) C 
S*, there exist m + 1 Cnfgs Gqj Gi, . . . , Gm such that: 

• Gq has mh—m+1 variables and generates the nonunary part Lq of L{G). 

• For i = 1, . . . ,m, Gi is a unary Cnfg with h variables which generates 
the unary part Li = L{G) H {oi}* . 

Proof. For i = 1, . . . ,m, the design of Gi is simply done by deleting from P 
all productions of the form B — > aj with i ^ j. Built in this manner, it is 
impossible for Gi to contain more than h variables. 

Giving such a linear upper bound on the number of variables for Go is 
slightly more involved. It is clear that any production of one letter or e 
directly from S is contrary to the purpose of Gq. This observation enables 
us to focus on the derivations by G that begins with replacing S by two 
variables. Consider a derivation 



for some non-empty words u,v E S+ and S BC G P. Gq simulates G, 
but also requires extra feature to test whether u and v contain respectively 
letters and aj, for some i j, and make only derivations that pass this 
test valid. To this end, we let the start variable S' of Gq make guess which 
of the two distinct letters in T, have to derive from B and C, respectively. 
We encode this guess into the variables in F \{S'} as a subscript like Bi (this 

means that, for w E'S*, Bi w if and only if B w and w contains at 

' Go G 

least one Oj). 

Now, we give a formal definition of Gq as a quadruple (F', S, P', S"), 
where 




G 



G 



G 



V' = {S'} U{Bi\B £V\ {S}, l<i<m} 
and P' consists of the following production rules: 



9 



1. {S' BiCj I 5 ^ SC G P and 1 < j < m with i ^ j}; 

2. {Bi CiDj,Bi CjDi I B CD eP with B ^ S and l<i,j<m}; 

3. {Bi ai \ B ^ ai e P and 1 < i < m}. 

We conclude the proof by checking that L{Go) = {w G L{G) \ w is not unary}. 
To this aim we prove the following: 

Claim. Let Bi be a variable of Gq that is different from the start variable. 

For w G S* , Bi =^ w if and only if B =^ w and w contains at least one 
Go G 

occurrence of Oi. 

Both implications will be proved using induction on the length of derivations. 

(Only-if part): If Bi =^ w (single-step derivation), then w must be 
Go 

and B ^ ai G P according to the typc-3 production in P' . Hence, the base 
case is correct. The longer derivations must begin with either B^ — )• CiDj 
or Bi CjDi for some B ^ CD e P and some I < j < m. It is enough to 
investigate the former case (the other one is completely similar). Then we 
have 

Bi =^ CiDj =^ wiDj wiW2 = w 

Cro Cro CtQ 

for some wi,W2 G S"*". By induction hypothesis, C wi, wi contains 

G 

tti, and D wo- Hence, B CD =^ wiD wiwo = w is & valid 
G G G G 

derivation by G, and w contains Oj. 

(If part): The base case is proved as for the direct implication. If B =^ 

w is not a single-step derivation, then it must start with applying to B some 
production B CD G P. Namely, 



B^CD^ w[D 

G G G 



W1W2 = W 



for some non-empty words w'^ , w'2 G 1;+. Thus, either w'l or w'2 contains 
ai\ let us say w't^ docs (the other case is similar). By induction hypothesis, 

Ci vo'i- A letter Oj occurring in w'2 is chosen, and the hypothesis gives 
Go 

Di w'n. As a result, the derivation 
^ Go ^ 

Bi CiDj wiDj '^'11^2 — 
Go Go Go 
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is valid. 

This completes the proof of the claim. 

To conclude the proof of the lemma, let us check that Go genetates the 
nommary part of L{G). For the direct implication, assume that u G L{Gq). 
Its derivation should be 

S' BiCj uiCj uiuo = u 

CrO Go CrO 

for some 5' — t- BiCj G P', ui,U2 G 1 < i,j < m, with i ^ j. Ignoring 
the subscripts i,j in this derivation brings us with S =^ u. Moreover, the 

G 

claim above implies that ui and U2 contain and aj, respectively. Thus, u 
is a nonunary word in L{G). 

Conversely, consider a nonunary word w G L{G). Being nonunary, \w\ > 
2, and this means that its derivation by G must begin with a production 
S — )• BC. Since B,C ^ S, they cannot produce e, and hence, we have 

S =J> BC =^ wiC wiwo = w 
G G G 

for some nonempty words wi,W2 G S"*". Again, being w nonunary, we can 
find a letter in w\ and a letter in W2 such that i ^ j- Now, the 
claim above implies 5^ =^ wi and W2- Since S" ^ -BjC, G P', the 

Go Go 

derivation 

=> -B^Cj wiCj wiW2 = w 
Go Go Go 

is a valid one by Gq. 

Note that, being thus designed. Go contains mh—m+1 variables. □ 

We conclude this section by shortly discussing some constructions related 

to the union of languages defined by iDFAs and by 2dfAs. 

First, we remind the reader that k iDFAs Ai, A2, . . . , with m, n2, . . . , 
states, respectively, can be simulated "in parallel" by a iDFA, in order to 
recognize the union L{Ai) U ^(^2) U • • • U L{Ak). In particular, the state set 

of A is the cartesian product of the state sets of the given automata. For this 
reason, the automaton A obtained according to this standard construction 
is usually called product automaton. Its number of states is ni • n2 • • • nfc. 

If Ai, A2, . . ■ , Ak are 2dfas and we want to obtain a 2dfa A accepting 
the union L{Ai) U L{A2) U • • • U L{Ak), the state cost reduces to the sum 
ni+n2 + - • -+71^, under the hypothesis that the automata are halting, namely, 
they do not present any infinite computation. 
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In particular, on input w, the automaton A simulates in sequence, for 
i = 1, . . . ,k, the automata Ai, halting and accepting in the case one i is 
found such that A^ accepts w. 

Suppose that, for i = 1, . . . , m, the state set of Ai is Qi with final state 
Qij and Qi (iQj ill for i / j. Then, A can be defined as follows. 

• The set of states is Q = Qi U U • • • U Qk- 

• The initial state is the initial state of Ai. 

• The final state is the final state q^j of A/^. 

• For i = 1, . . . ,k — 1, A contains all the transitions of Ai with the 
exception of those leading to the final state Qij of Ai. Those transitions 
lead directly to qkj, to halt and accept. 

In this way, the state Qij of Ai becomes unreachable. (We remind 
the reader that this state is also halting.) In the automaton A, the 
state Qij is "recycled" in a different way: it is used to prepare the 
simulation of ^i+i after a not accepting simulation of Ai. To this 
aim, each undefined transition of Ai leads in A to the state qij, where 
the automaton A loops, moving the head leftward, to reach the left 
endmarker. There, A moves the head one position to the right, on the 
first symbol of the input word, and enters the initial state of ^i+i, 
hence starting to simulate it. 

• All the transitions of A^ are copied in A without any change. Hence, 
if the input was rejected in all the simulations of Ai, A2, ■ ■ . , A^-i, it 
is accepted by A if and only if it is accepted by . 

We observe that each Idfa can be converted into a 2dfa in the form we 
are considering (cf. p. [4]), just adding the accepting state, which is entered 
on the right endmarker when the given iDFA accepts the input. So the above 
construction works (with the addition of at most k extra states) even when 
some of the j4j's are one-way. 

Finally, we point out that, as proven in |GMP07j . with a linear increasing 
in the number of the states, each 2dfas can be made halting. In particular, 
each n-state 2dfa can be simulated by a halting 2dfa with 4n states. 

So the above outlined construction can be extended to the case of non- 
halting 2dfas by obtaining a 2dfa with no more than 4 • (ni + n2 + - ■ ■ + nk) 
states. 
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3 From iNFAs to Parikh equivalent iDFAs 



In this section we present our first main contribution. Fixed an alphabet S = 
{ai,a2, ■ ■ ■ , ttm}, from each n-state Infa A with input alphabet S, we derive 
a Parikh equivalent Idfa A' with ^OiVn-inn) gf;a,tes. Furthermore, we prove 
that this cost is tight. 

Actually, as a preliminary step, we obtain a result which is interesting 
per se: if each word accepted by the given Infa A contains at least two 
different symbols, i.e., it is nonunary, then the Parikh equivalent Idfa A' 
can be obtained with polynomially many states. Hence, the superpolynomial 
blowup is due to the unary part of the accepted language. This result 



(presented in Theorem 3.3) looks quite surprising. Hence, before starting 
the technical presentation, we show an example with the aim to give, in a 
very simple case, a taste of our constructions. 

Example 3.1. Let us consider the following language 

L = {6a" I n mod 210 / 0} . 

Clearly, L does not contain any unary word. Furthemore, it can be verified 
that L is accepted by the 18-state Infa A in Fig. [T](^Le/iJ. In particular, in 
the initial state, reading the letter 6, in a nondeterministic way A chooses 
to verify the membership of the input to one of the following languages: 

• Li = {6a" I n mod 2 / 0} , 

• L2 = {6a" I n mod 3 7^ 0} , 

• L3 = {6a" I n mod 5 / 0} , 

• L4 = {6a" I n mod 7 / 0} . 

Of course, L = Li U L2 U L3 U L4. The automaton A can be transformed into 
an equivalent iDFA, by identifying the transitions leaving the initial state 
and by merging the 4 loops into a unique loop of length 2-3-5-7 = 210. Using 
standard distinguishability arguments, it can be shown that it is not possible 
to do better. As a matter of fact, the smallest complete Idfa accepting L 
requires 212 states. 

However, we can build a complete Idfa A' with only 22 states, accepting 
a language L' Parikh equivalent to L. To do that, for i = 1, . . . , 4, we replace 
each language Lj with a Parikh equivalent language L[ in such a way that 
all the words in L'- begin with the prefix a^~^b, and then we define L' as the 
union of the resulting languages, namely: 
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Figure 1: (Left) The iNFA A of Example 3.1 , (Right) The Parikh equivalent 



iDFA A'. 




• L[ 


= {6a" 1 n mod 2 / 0} = Li 


• ^2 


= {a6a"-i 1 n mod 3 / 0} , 


• ^3 


= {a^6a"~^ n mod 5 / 0} , 


• L'^ 


= {aHa'^-^ 1 n mod 7 / 0} , 


• L' 


= U L'2 U L'3 U L4 . 



In this way, given an input word w, after reading the first 4 input symbols, 
in a deterministic way A' can decide to which language L[, 1 < i < A, test 
the membership of in order to decide whether or not w £ L' . 

The automaton A' is depicted in Fig.[T](^i?ig/ii/ The vertical path starting 
from the initial state is used to select, depending on the position of the 
letter 6, one of the loops, i.e., to select which language L'^ must be used to 
decide the membership of the input to L' . (Of course, when the symbol b 
does not appear in the prefix of length 4, the automaton rejects by entering 
a dead state, which is not depicted.) 

The loops of A' are obtained by suitably "unrolling" the loops of the 
original iNFA A. The unrolled parts of the loops are moved before b- 
transitions and merged together in the vertical path which starts from the 
initial state. □ 
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Lemma 3.2. There exists a polynomial p such that for each n-state Infa 
A over T,, the Parikh image of the language accepted by A can be written as 

i^{L{A)) = YU\JZ,, (2) 

where: 

• y C N™ is a finite set of vectors whose components are bounded by 
p{n); 

• I is a set of at most p{n) indices; 

• for each i £ I , Zi CI is a linear set of the form: 

Zi = {vifi + niVi^i + n2Vi^2 H h Uk^Vi^k, | ni, n2, . . . , n^^ G N} , (3) 

with: 

— < ki < m, 

— the components of the offset are bounded by p{n), 

— the generators f i'j,2) • • • ) O'fG linearly independent vectors 
from {0, 1, ... , n}™'. 

Futhermore, if all the words in L{A) are nonunary then for each i €z I we 
can choose a nonunary vector Xi G Pred(t>j^o) such that all those chosen 
vectors are pairwise distinct. 

Proof. In jKTlOl Thms. 7 and 8] it was proved that can be written 

as claimed in the first part of the statement of the lemma, with y = 0, 
/ of size polynomial in n, and the components of each offset Vi^Q bounded 
by 0(n2™m™/2)|2| 

Now we prove the second part of the statement. Hence, let us suppose 
that all the words in L{A) are nonunary. Notice that this implies that also 
all the offsets Vi^Q are nonunary. 

^For the sake of completeness, we derive a rough upper bound for the cardinahty of the 
set I by counting the number of possible combinations of offsets and generators satisfying 
the given limitations. Since the components of the offsets are bounded by 0(n^'"m'"''^), 
the number of possible different offsets is 0{n^™' m"^ ^^). Furthermore, there are {n+ 1)™ 
vectors in {0, 1, • . • , n}™. Hence, is an upper bound for the number of possible sets of k 
generators, with k = 1, . . . ,m. This allows us to give 0{n^"^ ''^) as an upper bound 
for the cardinality of I. We point out that in |TolObl Thm. 4.1] slightly different bounds 
have been done. However, nothing is said about linearly independency of generators. 
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If for each i G I we can choose Xi G Pred(^;j_o) such that all ajj's are 
pairwise different, then the proof is completed. 

Otherwise, we proceed as follows. For a vector v, let us denote by \\v\\ 
its infinite norm, i.e., the value of its maximum component. Let us suppose 
/ C N and denote as Ni the maximum element of /. 

By proceeding in increasing order, for i £ I we choose a nonunary vector 
Xi G Fred{vifi) such that ||a;j|| < i and Xi is different from all already chosen 
Xj, i.e., Xi 7^ Xj for all j G I with j < i. The extra condition ||a;j|| < i will 
turn out to be useful later. 

When for an z G / it is not possible to find such Xi, we replace Zi by 
some suitable sets. Essentially, those sets are obtained by enlarging the 
offsets using sufficiently long "unrollings" of the generators. In particular, 
for j = 1, . . . , fcj, we consider the set 

Znj+j = {{vifi + hjVij) + niVi^i -\ h rik^Vi^k, | ni, . . . , n^^ G N} , (4) 

where hj is an integer satisfying the inequalities 

Ni+j< \\vifi + hjVij \\<Ni+j + n (5) 

Due to the fact that Vij G {0, . . . ,n}™', we can always find such hj. Fur- 
thermore, we consider the following finite set 

Yi = {vifi + niVi^i H h rik^Vi^k, | < ni < /ii, . . . , < n^, < /i^J . (6) 

It can be easily verified that 

Zi = Yi U\J ZN,+j ■ 

Now we replace the set of indices / by the set 

T=I-{i}U{Ni + l,...,Ni + ki}, 

and the set Y hy Y = Y DYi. We continue the same process by considering 
the next index i. 

We notice that, since we are choosing each vector Xi G Pred(i;j^o) in such 
a way that ||a;j|| < i, when we will have to choose the vector x^j^j for a set 
Z]\!i+j introduced at this stage, by the condition ([5]) we will have at least 
one possibility (a vector with one component equal to Nj + j and another 
component equal to 1; we remind the reader that, since the given automaton 
accepts only nonunary words, all offsets are nonunary). This implies that 
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after examining all sets Zi corresponding to the original set /, we do not 
need to further modify the sets introduced during this process. Hence, this 
procedure ends in a finite number of steps. 

Furthermore, for each Zi in the initial representation, we introduced at 
most m sets. Hence, the cardinality A'^ of the set of indices resulting at the 
end of this process is still polynomialj^ 

By ([5]) the components of the offsets which have been added in this 
process cannot exceed N + n. Hence, it turns out that m • (A'' + n) is an 
upper bound to the components of vectors in Yi. This permit to conclude 
that p{n) = m ■ {N + n) is an upper bound for all these amounts]^ □ 

Now we are able to consider the case of automata accepting only words 
that are nonunary. 

Theorem 3.3. For each n-state Infa over S, accepting a language none of 
whose words are unary, there exists a Parikh equivalent Idfa with a number 
of states polynomial in n. 



Proof. Let A be the given n-state Infa. According to Lemma 3.2 we express 
the Parikh image of L{A) as in ([2]) and, starting from this representation, 
we will build a Idfa ylnon that is Parikh equivalent to A. To this end, we 
could apply the following procedure: 

1. For each i £ I, build a Idfa Ai such that 'ip{L{Ai)) = Zi. 

1. From the automata Aj's so obtained, derive a Idfa A! such that 

3. Define a Idfa A!' such that i\){L{A!')) = Y. 

4. From A' and A" ^ using the standard construction for the union, build 
a Idfa Anon such that L{A^on) = L{A')[JL{A") and, hence, ■0(L(ylnon)) 
Y U UiG/ ^non is Parikh equivalent to A. 

Actually, we will use a variation of this procedure. In particular, steps [T] 
and [2] to obtain A' , are modified as we now explain. 

Let us start by considering i £ I. First, we handle the generators of Zi. 
To this aim, let us consider the function g : N™ — t- S* defined by 

giv) = a\'al^---at^ , 



^Since the cardinality of the set of indices, before the transformation, 
was 0{n^™' m™ (cf. Note [2]), the cardinahty A'^ after the transformation is 
0{n- 



4 



Hence p(n) = 0(n^'"'m™'/2+2). 
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Figure 2: (Left) A construction of iDFA where the state qu is simply 
denoted by u for clarity. (Right) A construction of iDFA Bi that accepts 
{wi^i,Wi^2, • • • ) Wi^ki}* for ki = 3. In the construction of the final iDFA Anonj 
if Wifi = acb, then the initial state q of Bi is merged with the state qacb 
of Aw- 



for each vector v = {ii, . . . , im) G N"^. 

Using this function, we map the generators Vi^i, . . . , Vi^^. into the words 

Si,i = g{vi^i), Si^2 = g{vi,2), Si^ki = g{vi,ki) ■ 

It is easy to define an automaton accepting the language {si^i, Si^2, ■ ■ ■ , Si^ki}* , 
which consists of a start state q with ki loops labeled with Sj^i, Si^2, ■ ■ ■ , Si^ki-, 
respectively. The state q is the only accepting state. However, this automa- 
ton is nondeterministic. 

To avoid this problem, we modify the language by replacing each Sjj, 
for j = 1, . . . , fcj, with a Parikh equivalent word Wi^j in such a way that 
for all pairwise different the corresponding words Wi^j and Wi^ji begin 
with different letters. This is possible due to the fact, being vn^ . . . ,Vi k 



linearly independent, according to Lemma 2.1 we can find ki diff'erent letters 
atj^,at2, ■ ■ ■ ,atk. £ ^ such that Vij[tj] > for j = l,...,ki. We "rotate" 
each Sij by a cyclic shift so that the resulting word, Wij, begins with an 
occurrence of the letter at ■ Then Wij is Parikh equivalent to Sij. For 
example, if Sij = a'^Ogfls and tj = 2, then Wij should be chosen as a|a3af . 

The construction of a iDFA Bi with one unique accepting state q that 
accepts {wi^i,Wi^2,---,Wi^ki}* must be now clear: q with ki loops labeled 
with these respective ki words (see Fig. ^Right)). Furthermore, due to the 
limitations deriving from Lemma |3.2[ the length of these loops is at most 



mn so that this Idfa contains at most 1 + m{mn — 1) states. 
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Now, we can modify this Idfa in order to build an automaton Ai recog- 
nizing a language whose Parikh image is Zj. To this aim, it is enough to add a 
path which from an initial state, reading a word Wifl with Parikh image Vi^Q, 
reaches the state q and then the part accepting {'Wi^i,Wi^2, ■ ■ ■ , Wi^ki}* that we 
already described. Due to the limitations on Vi^, this can be done by adding 
a polynomial number of states. In particular, we could take Wifl = g{vifi), 
thus completing step [l} However, when we have such Idfas for all i £ I, 
by applying the standard construction for the union to them, as in step [2| 
being / polynomial in n, the resulting Idfa could have exponentially many 
states in n, namely it could be too large for our purposes. 

To avoid this problem, the automaton A' which should be derived from 
steps[T][2j is obtained by using a different strategy. We introduce the function 
/ : N"^ — ;> S* defined as: for v G N™, f{v) = ■ir^{g{v)), where ^ denotes 
the 1-step left circular shift. For example, /(4, 1, 2, 0, . . . , 0) = 0^020301. It 
can be verified that the 1-step left circular shift endows / with the prefix 
property over the nonunary vectors, that is, for any u,v G N"^ that are 
nonunary, if f{u) is a prefix of f{v), then u = v. Let 

Wifl = f{xi)g{vifl - Xi) , 



where Xi G Pred(t)i^o) is given by Lemma 3.2 Clearly, ■0('w^i,o) = "^^ifl- We 
now consider the finite language 

Because the aij's are nonunary and / has the prefix property over nonunary 
words, the language W is prefix- free. We build a (partial) Idfa that ac- 
cepts W, which is denoted by Aw = (Q\Y,T,,qs,Sw, F\y), where: 

• Qw = {qu\ue Pref(T4^)}, 

• the state corresponding to the empty word is the initial state, 

• Fw = {qu\ue W}, 

• 5w is defined as: for u G Pref(IF) and o G S, if ua G Pref(iy), then 
5iq ui o) — quai while ^i^qui oi) is undefined otherwise. 

See Fig.^Left). Clearly, this accepts W. Since the longest word(s) in W is 
of length m ■p{n), this Idfa contains at most 1 + |/| • m •p(n) = 0{rnp^{n)) 
states. 

It goes without saying that each accepting state of this Idfa is only for 
one word in W, namely, two distinct words in W are accepted by Ay/ at 
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distinct two states. Now, based on Ay\r and the Idfas Bi with i £ I, we can 
build a finite automaton that accepts the language IJie/ WifiL{Bi) without 
introducing any new state. This is simply done by merging Qwi o with the 
start state of Bi. Given an input n, the resulting automaton A' simulates 
the Idfa A^y, looking for a prefix w oi u such that w £ W. When such 
a prefix is found, A' starts simulating Bi on the remaining suffix z, where 
i is the index such that w = Wi. Since W is prefix- free, we need only to 
consider one decomposition of the input as u = wz. This implies that A' 
is deterministic. Finally, we observe that A' contains at most 0{mp'^{n)) + 
|/|(1 + m(mn — 1)) = 0{mp{n){p{n) + mn)) states, i.e., a number which is 
polynomial in n. 

We now sketch the construction of a Idfa A" accepting a language Ly 
whose Parikh image is Y (step [s]). We just take Ly = {g{v) \ v G Y}. 
Let M be the maximum of the components of vectors in Y. With each 
V G {0, . . . , Af I™, we associate a state qv which is reachable from the initial 
state by reading the word g{v). Final states are those corresponding to 
vectors in Y. The automaton A" so obtained has (M + 1)™ = {p{n) + 1)*" 
states, a number polynomial in n. 

Finally, by applying the standard construction for the union (step [4]) , 
from automata A' and A" we obtain the Idfa Anon Parikh equivalent to the 
given Infa A, with number of states polynomial in nj^ □ 

We now switch to the general case. We prove that for each input alphabet 
the state cost of the conversion of Infas into Parikh equivalent Idfas is the 
same as for the unary alphabet. 

Theorem 3.4. For each n-state Infa over S, there exists a Parikh equiv- 
alent Idfa with gO{Vn-inn) ^^^^g^^ Furthermore, this cost is tight. 



Proof. According to Lemma |2.5[ from a given n-state Infa A with input 
alphabet S = {oi, 02, . • • , a^n}, we build a Infa Aq with n(m + 1) + 1 states 
that accepts the nonunary part of L{A) and m n-state iNFAs Ai, A2, . . ■ , Am 



that accept the unary parts of L{A). Using Theorem 2.4, for i = 1, . . . ,m, 
we convert Ai into an equivalent Idfa A'- with g,0{Vn-inn) g^^tes. We can 
assume that the state sets of the resulting automata are pairwise disjoint. 



^Assuming p{n) > mn, the number of states of A' is 0{inp^ (n)). Hence, the num- 
ber of states of Anon is 0(mp'"+2(n)). Using p{n) = 0{n'^'^^ m""^ ^'^+^) (cf. Note [i]), 
we conclude that the number of state of Anon is 0(mn^™'('"+2)^(mV2+2)(m+2)^ ^ 
Q(^3m3+6m2^m3/2+m2+2™+5) Heuce, it is polynomial in the number of states of the 
original Infa A, but exponential in the alphabet size m. 
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We define that accepts {w G L{A) \ w is unary} consisting of one 
copy of each of these Idfas and a new state qs, which is its start state. In 
reading the first letter Oj of an input, transits from qs to the state q in 
the copy of A'^^ if A'- transits from its start state to q on Oj (such q is unique 
because A[ is deterministic). These transitions from Qs do not introduce 
any nondeterminism because A'^, . . . , A'^ are defined over pairwise distinct 
letters. After thus entering the copy, Au merely simulates A'^. The start 
state qs should be also an accepting state if and only if e G L{A'-) for some 
1 < i < m. Being thus built, Au accepts exactly all the unary words in L{A) 
and contains at most m ■ eO(Vn-inn) _|_ -j^ states. 

On the other hand, for the nonunary part of L{A), using Theorem |3.3[ we 
convert Aq into a Parikh equivalent iDFA A^ with a number of states r(n), 
polynomial in n. The standard product construction is applied to A^ and 
An in order to build a Idfa accepting UL(j4„). The number of states 

of the Idea thus obtained is bounded by the product g^OiVn-inn) . ^^^^ _ 

gO(Vn-lnn) . gO(lnn) ^ gO( Vn-ln n+ln n) ^ ^j^j^j^ jg g^Qj bounded by e'^^^"''"" . 



Finally, we observe that by Theorem 2.4 in the unary case e®(^'^'^'^") is 
the tight cost of the conversion from n-state Infas to Idfas. This implies 
that the upper bound we obtained here cannot be reduced. 



This completes the proof of Theorem 3.4 □ 



We conclude this section with some observations. We proved that fixed 
an alphabet S, the state cost of the conversion of n-state Infas into Parikh 
equivalent iDFAs is polynomial in n, in the case each word in the accepted 
language is nonunary. Otherwise, the cost is exponential in y/n - Inn. A 
closer inspection to our proofs shows that these costs are exponential in the 
size of the alphabet (see Notes [2]j5]) . 

While the cost of the conversion in the general case has been proved 



to be tight (see Theorem 3.4), it should be interesting to see whether or 



not the cost for the conversion in the case n-state Infas accepting only 



nonunary words (Theorem 3.3) could be further reduced. To this respect, 
we point out that n is a lower bound. In fact, a smaller cost would im- 
ply that any given iNFA (or Idfa) Bq, could be converted into a smaller 
Parikh equivalent Idfa Bi which, in turn, could be further converted in a 
smaller Parikh equivalent Idfa B2 an so on. In this way, from Bq we could 
build an arbitrary long sequence of automata Bq, Bi, B2, ■ ■ all of them 
Parikh equivalent to Bq, and such that for each i > 0, Bi would be smaller 
than Bi-i. This clearly does not make sense. With a similar argument, we 
can also conclude that even the costs of the conversion of n-state Infas into 
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a Parikh equivalent Infas must be at least n. 



4 From CFGs to Parikh equivalent iDFAs 

In this section we extend the results of Section [3] to the conversion of CFGs in 



Chomsky normal form into Parikh equivalent Idfas. Actually, Theorem 3.3 
will play an important role in order to obtain the main result of this section. 
The other important ingredient is the following result proven in 2011 by 
Esparza, Ganty, Kiefer and Luttenberger [EGKLllj , which gives the cost of 
the conversion of Cnfgs into Parikh equivalent Infas. 

Theorem 4.1 ([EGKLTI]). For each Cnfg with h variables, there exists a 
Parikh equivalent iNFA with = 0{A^) states. 



We point out that the upper bound in Theorem |4.1| does not depend on 
the cardinality of the input alphabet. 

By combining Theorem |4 . 1 1 with the main result of the previous section, 



i.e., Theorem 3.4, we can immediately obtain a double exponential upper 
bound in h for the size of iDFAs Parikh equivalent to Cnfgs with h variables. 
However, we can do better. In fact, we show how to reduce the upper 
bound to a single exponential in a polynomial of h. We obtain this result by 
proceeding as in the case of finite automata: we split the language defined 
by given grammar into the unary and nonunary parts, we make separate 
conversions, and finally we combine the results. 

As in Sectionjsj from now on let us fix the alphabet S = {ai, a2, . . . , a^}- 



We start by considering the nonunary part. By combining Theorem 4.1 with 
Theorem [3^ we obtain: 



Theorem 4.2. For each h-variable Cnfg with terminal alphabet T,, gen- 
erating a language none of whose words are unary, there exists a Parikh 
equivalent iDFA with 2'~'^^^ states. 

Proof. First, according to Theorem |4.1[ we can transform the grammar into 



a Parikh equivalent iNFA with 0(4'^) states. Then, using Theorem 3.3 
we convert the resulting automaton into a Parikh equivalent iDFA, with a 
number of states polynomial in 4'* = 2^^, hence exponential in h. □ 

Now, we switch to the general case. 

Theorem 4.3. For each h-variable Cnfg with terminal alphabet S, there 
exists a Parikh equivalent Idfa with at most 2^^^^^ states. 
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Proof. Let us denote the given Cnfg by G = {V, S, P, S), where \V\ = h. 



In the case m = 1 (unary alphabet), one can employ Theorem 2.3 (note 
that, over a unary alphabet, two languages Li,L2 are Parikh equivalent if 
and only if they are equivalent). Hence, from now on we assume m > 2. 

Let us give an outline of our construction first: 

1. From G, we first create Cnfgs Go, Gi, . • • , Gm such that Go generates 
the nonunary part of L{G) and Gi,G2, ■ ■ ■ ,Gm generate the unary 
parts. 

2. The grammars Gi , G2 , • • • , G^ are converted into respectively equiv- 
alent unary iDFAs Ai, A2, ■ ■ ■ , Am- From these Idfas, a Idfa ^unary 
accepting the set of all unary words in L(G) is constructed. 

3. The grammar Go is converted into a Parikh equivalent Idfa Anon- 

4. Finally, from Aunary and Anom a Idfa that accepts the union of 
-^(^unary) and L(Anon) is obtained. 

Observe that L(Aunary) = {w £ L{G) \ w is unary} and L(Anon) is Parikh 
equivalent to L{Gq) = {w G L{G) \ w is not unary}. Thus, the Idfa which 
is finally constructed by this procedure is Parikh equivalent to the given 
grammar G. 

We already have all the tools we need to implement each step in the 
above construction. 



[TJ We can obtain grammars Gq , Gi , . . . , Gm according to Lemma 2^ In 
particular. Go has rah — m + 1 variables, while each of the remaining 
grammars has h variables. 



[2j According to Theorem 2.3, for i = 1, . . . , m, grammar Gj is converted 



into a Idfa Ai with less than 2^^ states. Using the same strategy 



presented in the proof of Theorem 3.4, from Ai,...,Am, we define 
^unary Consisting of one copy of each of these iDFAs and a new state 
Qs, which is its start state. Hence, the number of states of ^unary does 
not exceed m2^ . 



|3j This step is done using Theorem 4.2 The number of the states of the 



resulting iDFA Anon is exponential in the number of the variables of 
the grammar Go and, hence, exponential in h. 
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The final Idfa can be obtained as the product of two automata ^unary 
and ^non- Considering the bounds obtained in Step [2] and [3] we con- 
clude that the number of states in exponential in /i^rl □ 



We point out that in [LP12j it has been proved a result close to Theorem 4.3 
in the case of Cnfgs generating letter bounded languages, i.e., subsets of 
a^a2 • • • a'^. In particular, an upper bound exponential in a polynomial in 
h has been obtained. However, the degree of the polynomial is, in turn, a 



polynomial in the size m of the alphabet. Here, in our Theorem 4.3, the 
degree is 2. Hence, it does not depend on m. 

We observe that in |PSW02t Thm. 7] it was proved that there is a con- 
stant c > such that for infinitely many h > there exists a Cnfg with 
h variables generating a unary language such that each equivalent Idea re 
quires at least 2'^'^ states. This implies that the upper bound in Theorem 
cannot be improved. 



4.3 



Even the bound given in Theorem 4.2, for languages consisting only of 
nonunary words, cannot be improved, by replacing the exponential in h by 
a slowly increasing function. This can be shown by adapting a standard 
argument from the unary case (e.g., [PSWd2, Thm. 5]). For any integer 
h > 3, consider the grammar G with variables A, B, Aq, Ai, . . . , Afi_^, and 
productions 

A^ a, B ^b, Aq ^ AB , Aj ^ Aj^iAj^i , for j = 1, h-3. 

As easy induction shows that, for j = 1, . . . , /i — 3, the only word which is 
generated from Aj is {ab)'^\ Hence, by choosing A^s as start symbol, we 
have L{G) = {{ab)^}, with H = 2^'^. An immediate pumping argument 
shows that each Idea (or even Inea) with less than 2H + \ states accepting 
a word of length 2H, should also accept some words of length < 2H. Since 
L{G) contains only the word (ab)^ , it turns out that each Idea accepting a 
language Parikh equivalent to L{G) requires 2H + 1 states, namely a number 
exponential in h. 

®We briefly discuss how the upper bounds depends on m, the alphabet size. Using the 



estimation of the cost of the conversions in Theorem 3.3 (see Note[5| and observing that 



the grammar Go in the previous construction has less than mh variables, we can conclude 
that the automaton Anon has 2''''^'" ' states. Hence, the number of the states of the Idfa 
finally obtained by the construction given in the proof of Theorem 4.3 is 2'^^'^ 
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5 Conversions into Parikh equivalent 2dfas 



In this section we study the conversions of Infas and Cnfgs into Parikh 
equivalent two-way deterministic automata. In the previous sections, for 
the conversions into one-way deterministic automata, we observed that the 
unary parts are the most expensive. However, the cost of the conversions 
of unary iNFAs and Cnfgs into 2dfas are smaller than the costs for the 
corresponding conversions into iDFAs. This allows us to prove that, in the 
general case, the cost of the conversions of Infas and Cnfgs into Parikh 
equivalent 2dfas are smaller than the cost of the corresponding conversions 
into Id FAS. 

Let us start by presenting the following result, which derives from |Chr86j : 

Theorem 5.1. For each n-state unary iNFA there exists an equivalent halt- 
ing 2dfa with + 1 states. 

Proof. For the sake of completeness, we present a proof which is essentially 
the same given by Chrobak |Chr86| Thm. 6.2] where, however, the obtained 
upper bound was O(n^). Then we will explain why the big-O in the upper 
bound can be removed. 

First of all, each n-state unary Infa A can be converted into an equiva- 
lent Infa Ac in a special form, which is known as Chrobak normal form |Chr86t 
Lemma 4.3], consisting of a deterministic path which starts from the initial 
state, and k > disjoint deterministic cycles. The number of states in the 
path is s = O(n^), while the total number of states in the cycles is r < n. 
From the last state of the path there are k outgoing edges, each one of them 
reaching a fixed state on a different cycle. Hence, on each input of length 
i > s, the computation visits all the states on the initial path, until the last 
one where the only nondeterministic choice is taken, moving to one of the 
cycles, where the remaining part of the input is examined. However, if £ < s 
then computation ends in a state on the initial path, without reaching any 
loop. (In the special case k = the accepted language is finite.) 

A 2dfa B can simulate the Infa Ac in Chrobak normal form, traversing 
the input word at most /c + 1 times. In the first traversal, the automaton 
checks whether or not the input length is < s. If this is the case, then the 
automaton accepts or rejects according to the corresponding state on the 
initial path of Ac. Otherwise, it moves to the right endmarker. This part 
can be implemented with s + 1 states (s states for the simulation of the 
initial path, plus one more state to move to the right endmarker). From the 
right endmarker, the automaton traverses the input leftward, by simulating 
the first cycle of Ac from a suitable state (which is fixed, only depending 
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on s and on the cycle length). If the left endmarker is reached in a state 
which simulates a final state in the cycle then the automaton B moves to 
the final state qf and accepts, otherwise it traverses the input rightward, 
simulating the 2nd cycle of Ac, and so on. Hence, in the {i + l)th traversal 
of the input, 1 < i < fc, the ith cycle is simulated. So the number of states 
used to simulate the cycles is equal to the total number of states in the 
cycles, namely r. Considering the final state qf, we conclude that B can be 
implemented with s + r + 2 = O(n^) states. 

Finally, we point out that finer estimations for the number of the states 
on the initial path and in the loops of Ac have been found. In [Gef07| . it was 
proved that the number of s of the states in the initial path is bounded by 
r? — 2 and the sum r of the numbers of the states in the cycles is bounded by 
n — ij^ The first bound has been further reduced in [Gawllj to s < — n. 
This allows us to conclude that the 2dfa B can be obtained with at most 

+ 1 states. □ 



The upper bound given in Theorem 5.1 is asymptotically tight. As 
proven in jChr86( Thm. 6.3], for each integer n there exists an n-state unary 
Infa such that any equivalent 2dfa requires O(n^) states. 

By combining Theorem |5.1| with the bound for the transformation of 



unary Cnfgs into Infas given in Theorem 2.3 we immediately obtain the 
following bound. 

Theorem 5.2. For each h-variable unary Cnfg there exists an equivalent 
halting 2dfa with at most (2^'*^^ + 1)^ + 1 states. 

We now have the tools for studying the conversions of Infas and CFGs 
into Parikh equivalent 2dfas. Let us start with the first conversion. 

Theorem 5.3. For each n-state iNFA there exists a Parikh equivalent 2dfa 
with a number of states polynomial in n. 



Proof. We use the same technique as in the proof of Theorem |3.4[ by 
splitting the language accepted by the given Infa A into its unary and 



nonunary parts, as explained in Lemma 2.5 Each unary part is accepted 



by a Infa with n states. According to Theorem 5.1, this gives us m 



2dfas Bi, B2, ■ . . , Bra, accepting the unary parts, each one of them has 



'^Actually, there is an exception: if the given Infa A is just one cycle of n states then A 
is already a Idfa. If it is minimal, then in any equivalent Infa we cannot have a cycle 
with less than n states which is useful to accept some input. However, in this degenerate 



case, Theorem 5.1 is trivially true, without making use of the Chrobak normal form. 
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at most n? + 1 states, where m is the cardinahty of the input alphabet 
S = {ai,a2, . . ■ ,am}- 

For the nonunary part we have a iNFA with n{m + 1) + 1 states and, 
according to Theorem |3.3[ a Parikh equivalent Idfa Bq with a number of 
states polynomial in n{m + 1) + 1 and, hence, in n. 

Finally, as explained in Section |2.1[ we can build a 2dfa B such that 
L{B) = L{Bo) U L{Bi) U • • • U L{Bm). Hence, B is Parikh equivalent to the 
given Infa A and its number of states is polynomial in nj^ □ 

Now, we consider the conversion of CFGs. 

Theorem 5.4. For each h-variable Cnfg there exists a Parikh equivalent 
2dfa with 2'^(^) states. 

Proof. Even in this case, the construction is obtained by adapting the corre- 
sponding conversion into iDFAs (Theorem 4.3). In particular, the construc- 



tion uses the same steps [T]|4] given in that proof, with some modifications in 
steps [2] and [4j which are replaced by the following ones: 

m. The grammars Gi , G2 , ■ ■ • , Gm s-rs converted into respectively equiva- 
lent unary 2dfas A[,A2, . . . , A'^. 

Hf. Finally, from Anon, -^'i-, ^'21 ■ ■ ■ ■> ^ 2dfa that accepting the language 
L(Anon) U L{A'^) U L(^2) U • • • U L{A'^) is obtained. 

Clearly, the 2dfas resulting from this procedure is Parikh equivalent to the 
original grammar G. The costs of steps [T] and [3] has been discussed in the 
proof of Theorem 4.3 For the remaining steps: 



[2I. According to Theorem 5.2, for i = 1, . . . , m, the 2dfa A'^ has at most 
(22^-1 + 1)2 + 1 states. 

m. We use the construction presented at the end of Section [2| to obtain 
a 2dfa whose number of states is the sum of the number of the states 
of ^non, ^'1, ^2, . . . , hence 20^^)f\ □ 



By making the same considerations as in Notes [4] and [s] we can obtain an 0(m ) 
bound for the degree of the polynomial. 

^Explicitly mentioning the dependency on the alphabet size m, we can give a 2'^'''"" ' 
bound. This derives from the size of the automaton Anon (cf. Note|6]). 
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6 Conclusion 



We proved that the state cost of the conversion of n-state Infas into Parikh 
equivalent iDFAs is e®^^"'''^"). This is the same cost of the conversion of 
unary iNFAs into equivalent iDFAs. Since in the unary case Parikh equiv- 
alence is just equivalence, this result can be seen as a generalization of the 
Chrobak conversion [Chr86] to the nonunary case. More surprisingly, such 
a cost is due to the unary parts of the languages. In fact, as shown in The- 
orem 3.3, for each n-state unary iNFA accepting a language which does not 
contain any unary word there exists a Parikh equivalent iDFA with poly- 
nomially many states. Hence, while for the transformation from iNFAs to 
equivalent Idfas we need at least two different symbols to prove the expo- 
nential gap from n to 2" states and we have a smaller gap in the unary case, 
for Parikh equivalence the worst case is due only to unary words. 



Even in the proof of our result for CFGs (Theorem 4.3), the separation 
between the unary and nonunary parts was crucial. Also in this case, it 
turns out that the most expensive part is the unary one. 

On the other hand, in our conversions into Parikh equivalent 2dfAs, the 
most expensive part turns out to be the nonunary one. 
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